8 research outputs found

    RANK-BASED TEMPO-SPATIAL CLUSTERING: A FRAMEWORK FOR RAPID OUTBREAK DETECTION USING SINGLE OR MULTIPLE DATA STREAMS

    Get PDF
    In the recent decades, algorithms for disease outbreak detection have become one of the main interests of public health practitioners to identify and localize an outbreak as early as possible in order to warrant further public health response before a pandemic develops. Today’s increased threat of biological warfare and terrorism provide an even stronger impetus to develop methods for outbreak detection based on symptoms as well as definitive laboratory diagnoses. In this dissertation work, I explore the problems of rapid disease outbreak detection using both spatial and temporal information. I develop a framework of non-parameterized algorithms which search for patterns of disease outbreak in spatial sub-regions of the monitored region within a certain period. Compared to the current existing spatial or tempo-spatial algorithm, the algorithms in this framework provide a methodology for fast searching of either univariate data set or multivariate data set. It first measures which study area is more likely to have an outbreak occurring given the baseline data and currently observed data. Then it applies a greedy searching mechanism to look for clusters with high posterior probabilities given the risk measurement for each unit area as heuristic. I also explore the performance of the proposed algorithms. From the perspective of predictive modeling, I adopt a Gamma-Poisson (GP) model to compute the probability of having an outbreak in each cluster when analyzing univariate data. I build a multinomial generalized Dirichlet (MGD) model to identify outbreak clusters from multivariate data which include the OTC data streams collected by the national retail data monitor (NRDM) and the ED data streams collected by the RODS system. Key contributions of this dissertation include 1) it introduces a rank-based tempo-spatial clustering algorithm, RSC, by utilizing greedy searching and Bayesian GP model for disease outbreak detection with comparable detection timeliness, cluster positive prediction value (PPV) and improved running time; 2) it proposes a multivariate extension of RSC (MRSC) which applies MGD model. The evaluation demonstrated the advantage that MGD model can effectively suppress the false alarms caused by elevated signals that are non-disease relevant and occur in all the monitored data streams

    Parallel Feature Selection Using Only Counts

    Get PDF
    Count queries belong to a class of summary statistics routinely used in basket analysis, inventory tracking, and study cohort finding. In this article, we demonstrate how it is possible to use simple count queries for parallelizing sequential data mining algorithms. Specifically, we parallelize a published algorithm for finding minimum sets of discriminating features and demonstrate that the parallel speedup is close to the expected optimum.&nbsp

    Modeling Baseline Shifts in Multivariate Disease Outbreak Detection

    Get PDF
    Current outbreak detection algorithms monitoring single data stream may be prone to false alarms due to baseline shifts that could be caused by large local events such as festivals or super bowl games. In this paper, we propose a Multinomial-Generalized-Dirichlet (MGD) model to improve a previously developed spatial clustering algorithm, MRSC, by modeling baseline shifts. Our study results show that MGD had better ROC and AMOC curves when baseline shifts were introduced. We conclude that MGD can be added to outbreak detection systems to reduce false alarms due to baseline shifts

    Spatial and Temporal Algorithm Evaluation for Detecting Over-The-Counter Thermometer Sale Increases during 2009 H1N1 Pandemic

    Get PDF
    Background Spatial outbreak detection algorithms using routinely collected healthcare data have been developed since the late 90s to identify and locate disease outbreaks. However, current well-received spatial algorithms assume only one outbreak cluster present at the same point of time which may not be valid during a pandemic when several clusters of geographic areas concurrently occur. Based on a retrospective evaluation on time-series and spatial algorithms, this paper suggests that time series analysis in detection of pandemics is still a desirable process, which may achieve more sensitive performance with better timeliness. Methods In this paper, we first prove in theory that two existing spatial models, the likelihood ratio and the Bayesian spatial scan statistics, are not useful if multiple clusters occur at the same point of time in different geographic regions. Then we conduct a comparison between a spatial algorithm, the Bayesian Spatial Scan Statistic (BSS), and a time series algorithm, the wavelet anomaly detector (WAD), on the performance of detecting the increase of the over-the-counter (OTC) medicine sales during 2009 H1N1 pandemic. Results The experiments demonstrated that the Bayesian spatial algorithm responded to the increase of thermometer sales about 3 days later than the time series algorithm. Conclusion Time-series algorithms demonstrated an advantage for early outbreak detection, especially when multiple clusters occur at the same time in different geographic regions. Given spatial-temporal algorithms for outbreak detection are widely used, this paper suggests that epidemiologists or public health officials would benefit by applying time series algorithms as a complement to spatial algorithms for public health surveillance

    Probabilistic Case Detection for Disease Surveillance Using Data in Electronic Medical Records

    Get PDF
    This paper describes a probabilistic case detection system (CDS) that uses a Bayesian network model of medical diagnosis and natural language processing to compute the posterior probability of influenza and influenza-like illness from emergency department dictated notes and laboratory results. The diagnostic accuracy of CDS for these conditions, as measured by the area under the ROC curve, was 0.97, and the overall accuracy for NLP employed in CDS was 0.91
    corecore